Selecting a loss function before training a model is one of the crucial decisions to make. But it is always confusing to know which loss function to use for a particular model. In this article, I’ll explain what loss functions are and some of the most popular ones to select from. More specifically we will learn:
The significance of loss functions in machine learning pipeline and why we need them
Popular loss functions for classification problems
Popular loss functions for regression problems
How to select the most appropriate loss function for a particular task
The process of learning from data to find the solution to a problem is machine learning. Ideally, the dataset we find has labels making it a supervised problem. The learning process is all about using the training data to produce a predictor function. It maps input vector ‘x’ to ground truth ‘y’. We want the predictor to work even for examples that it hasn’t yet seen before in the training data. That is, we want it to be as generalized as possible. And because we want it to be general, this forces us to design it in a principled, mathematical way.
For each data point ‘x’, we start computing a series of operations on it. Whatever operations our model requires to produce a predicted output, we then compare the predicted output to the actual output ‘y’ to produce an error value. That error is what we minimize during the learning process using an optimization strategy like gradient descent.
We are computing that error value by using a loss function. Loss functions can calculate errors associated with the model when it predicts ‘x’ as output and the correct output is ‘y’*.
Unfortunately, there is no universal loss function that works for all kinds of data. There are many factors that affect the decision of which loss function to use like the outliers, the machine learning algorithm, speed-accuracy trade-off, etc.
Now, let’s discuss some of the most popular loss functions for a classification task.
If you have trained a classifier using any deep learning framework, like Keras, you might have already used this loss function, since cross-entropy loss is the default loss function for classification task. Its origins come from the field of information theory, and it uses the concept of information entropy for finding magnitude of deviation between two probability distributions. But don’t worry, in order to effectively use this loss function for your machine learning model, you need not go through the concepts of information theory. However, you must keep in mind that an ideal model with no errors has a cross-entropy loss of 0.
The cross-entropy loss rises as the output probability deviates from the ground truth. As predicted probability approaches 1 the log loss slowly decreases, but as the predicted probability decreases the log loss increases pretty fast. Thus log loss penalizes less confident results (value less than 1) along with obviously inaccurate results. It especially penalizes those predictions that it is confident are inaccurate.
There is another commonly used type of loss function in classification tasks called the hinge loss. It was originally developed for binary classification tasks, but later, an extension for multi-class classification was also created. Following is the definition for hinge loss for binary classification problems, where target values can only be from set {-1, 1}:
![hinge loss equation])(https://my-static-images.s3.us-east-2.amazonaws.com/article_3_hinge_loss.png)
As we can see in the equation, it penalizes predictions not only when they are incorrect but even when they are correct with value less than 1. So the loss is only 0 when the signs match and hinge loss output is greater than or equal to 1. For example, if our score for a specific data point was 0.3 but the label was -1 we get a 1.3 penalty. In another case if hinge loss output is -0.7 which actually is labeled -1, will still get a penalty of 0.3, but if we predicted -1.7 we get no penalty.
Hinge loss is easier to compute than the cross-entropy loss. It is faster to train via gradient descent since a lot of the time the gradient is 0 so you don’t have to update the weights. If you need to make real-time decisions with less accuracy, depend on the hinge loss over cross-entropy loss. But if accuracy over speed matters go with the cross-entropy loss. It is a trade-off.
Now, let’s switch to some of the most popular loss functions for a regression problem.
One of the most popular loss functions for regression tasks is mean square error (MSE) loss. It measures the average amount that the model’s predictions vary from the correct values. So, we can think of it as the measure of the model’s performance on the training set. Broadly, we perform the following operations on each batch of the training dataset to get MSE loss for that particular batch:
For all items in a batch, find the difference between predicted and actual value.
Square all these differences.
Add them all up.
Divide this sum by the total number of items in batch.
The square is in there because it lets our results to be quadratic or convex. When we plot a quadratic function it will have a U-shape with only 1 minimum. So when we use an optimization strategy like gradient descent we won’t get stuck in a local minimum. We’ll find the global minimum which will ultimately help us find the ideal parameter values to optimize the objective function.
Another popular loss function for regression is called the mean absolute error (MAE) loss. It is similar to MSE loss except that it does not square the difference between predicted output and actual output. Instead, it takes the absolute value of the difference. In other words, MAE does not take into consideration the direction (or sign) of output. The major difference between MAE and MSE is that the latter only focuses on very large deviations in output. This also means that MAE loss is more robust to outliers than the MSE loss. So use it if you have a lot of anomalies in your dataset. MAE gives equal importance to all data points whereas the MSE focuses on the extremes. The square of very small numbers is even smaller and the square of a really big number is huge.
So depending upon whether your problem is a regression problem or a classification problem, you can use one of several loss functions described above. You can now select a loss function that optimizes for either speed or accuracy.